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Abstract 



Switching state-space models (SSSM) are a very popular class of time series models that have found 
many applications in statistics, econometrics and advanced signal processing. Bayesian inference for these 
models typically relies on Markov chain Monte Carlo (MCMC) techniques. However, even sophisticated 
MCMC methods dedicated to SSSM can prove quite inefficient as they update potentially strongly 



correlated discrete- valued latent variables one-at-a-time (Carter and Kohn, 1996 Gerlach et al., 2000 



Giordani and Kohn, 20081. Particle Markov chain Monte Carlo (PMCMC) methods are a recently 



developed class of MCMC algorithms which use particle filters to build efficient proposal distributions 



in high-dimensions (Andrieu et al., 20101. The existing PMCMC methods of Andrieu et al. (20101 are 
applicable to SSSM, but are restricted to employing standard particle filtering techniques. Yet, in the 
context of discrete-valued latent variables, specialised particle techniques have been developed which can 



outperform by up to an order of magnitude standard methods (Fearnhead, 1998 Fearnhead and Clifford, 



2003 Fearnhead, 2004 1. In this paper we develop a novel class of PMCMC methods relying on these very 
efficient particle algorithms. We establish the theoretical validy of this new generic methodology referred 
to as discrete PMCMC and demonstrate it on a variety of examples including a multiple change-points 
model for well-log data and a model for U.S. /U.K. exchange rate data. Discrete PMCMC algorithms 
are shown to outperform experimentally state-of-the-art MCMC techniques for a fixed computational 



complexity. Additionally they can be easily parallelized (Lee et al., 2010 1 which allows further substantial 
gains. 

Keywords: Bayesian inference, Markov chain Monte Carlo, optimal resampling, particle filters, se- 
quential Monte Carlo, switching state-space models. 
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1 Introduction 



Linear Gaussian Switching State-Space Models (SSSM) are a class of time series models in which the pa- 
rameters of a linear Gaussian model switch according to a discrete latent process. They are ubiquitous in 



statistics ( 


Cappe et al., 2005 


et al., 2007 


) and advanced sig 



to describe in a compact and interpretable way regime switching time series. SSSM have been successfully 



used to describe, among others, multiple change-point models (Fearnhead and Clifford, 2003 Giordani and 



Kohn, 2008), nonparametric regression models with outliers (Carter and Kohn, 1996) and Markov switching 



autoregressions (Billio and Monfort, 1998 Friihwirth-Schnatter, 2006 Kim and Nelson, 1999). 

Performing Bayesian inference for SSSM requires the use of Markov chain Monte Carlo (MCMC) tech- 
niques. The design of efficient sampling techniques for this class of models has been a subject of active 
research for over fifteen years, dating back at least as far as Carter and Kohn (1994[); Shephard (1994[). A 
recent overview of MCMC in this context can be found in Cappe et al. (2005[ ); [Friihwirth-Schnatter (2006[ ) . 
The main practical difficulty lies in simulating from the conditional distribution of the trajectory of the 
discrete-valued latent process. The cost of computing this distribution grows exponentially in the length of 
the observation record and therefore obtaining an exact sample from it is impractical for all but tiny data 
sets. A standard strategy is instead to update the components of the discrete latent process one-at-a-time 
(Carter and Kohn, 1996 Gerlach et al., 2000 Giordani and Kohn, 2008 1. However, it is well-known that such 
an approach can significantly slow down the convergence of MCMC algorithms. An alternative is to sample 
approximately from the joint distribution of the latent discrete trajectory using particle filters: non-iterative 
techniques based on a combination of importance sampling and resampling techniques, see [Doucet et al.| 
p00l| ); [Lhr(200l| ) for a review of the literature. Empirical evidence suggests that particle filters are able to 
provide samples whose distribution is close to the target distribution of interest and this evidence is backed 
up by the rigourous quantitative bounds established in Del Moral (2004[ chapter 8). This motivates using 
particle filters as proposal distributions within MCMC. 

This idea is very natural, but its realization is far from trivial as the distribution of a sample generated 
by a particle filter does not admit a closed-form expression hence preventing us from directly using the 
standard Metropolis-Hastings (MH) algorithm. In a recent paper Andrieu et al. (2010[ ) have shown that 
it is possible to bypass this problem. The authors have proposed a whole class of MCMC algorithms 
named Particle MCMC (PMCMC) relying on proposals built using particle filters. These algorithms have 
been demonstrated in the context of non-linear non-Gaussian state-space models and are directly applicable 
to SSSM; see also Flury and Shephard (20To| for applicat ions in financial econometrics. However, the 
standard particle methods employed in Andrieu et al. (2010 1 do not fully exploit the discrete nature of the 
latent process in SSSM. This was recognized early by Paul Fearnhead who proposed an alternative generic 
algorithm, which we refer to as the Discrete Particle Filter (DPF) (Fearnhead, 1998 1. The DPF bypasses the 
importance sampling step of standard particle techniques and can be interpreted as using a clever random 
pruning mechanism to select support points from the exponentially growing sequence of discrete latent state 



spaces. The DPF methodology has been demonstrated successfully in a variety of applications (Cappe et al., 
2005 Fearnhead, 1998 Fearnhead and Clifford, 2003 Fearnhead, 2004). It has been shown to significantly 



outperform alternative sophisticated approaches such as the Rao-Blackwellized particle filters developed in 
Chen and Liu (2000[ ); Doucet et al. (2000{ 2001) by up to an order of magnitude for a fixed computational 



complexity. 

The main contribution of this article is to propose a novel class of PMCMC algorithms referred to as 
discrete PMCMC methods relying on the DPF for this important class of statistical models. The practical 
efficiency of the proposed methods relies on an original backward sampling procedure. We show that on a 
variety of applications this new generic methodology outperforms state-of-the-art MCMC algorithms for a 
fixed computational complexity. Moreover, as in the case of standard particle filters (Lee et al., 2010), the 
DPF can be parallelized easily. This suggests that even greater computational gains can be achieved. 

The rest of the paper is organised as follows. In Section[2]we present the general class of SSSM considered 
in this paper and give an illustrative exa mple. In Section [3| we discuss the intractability of exact inference 
in SSSM and present the DPF algorithm ( [Fearnhead, 1998] [Fearnhead and Clifford, 2003} [Fearnhead, 2004 ). 
Our presentation is slightly non-standard and explicitly introduces the random support sets generated by the 
algorithm. This allows us to describe the DPF precisely and compactly in a probabilistic way which proves 
useful to establish the validity of the proposed algorithms. We also review standard MCMC techniques used 
in this context. In Section [4[ we introduce discrete PMCMC algorithms relying on the DPF to perform 
inference in SSSM and present some theoretical results. In Section [5| we review generic practical issues and 
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demonstrate the efficiency of the proposed methods in the context of three examples. Finally in Section [6] 
we discuss several extensions of this work. 



2 Switching state-space models 
2.1 Model 

From herein, we use the standard convention whereby capital letters are used for random variables while 
lower case letters are used for their values. Hereafter for any generic process {zn} we will denote := 
{zi, Zi+i, . . . , Zj). The identity matrix of size p is denoted Ip and the matrix of zeros of size p x q by Opxq- 

Consider the following SSSM, also known in the literature as a conditionally linear Gaussian state-space 
model or a jump linear system. The latent state process {^n}„>i is such that Xn takes values in a finite 
set X. It is characterized by its initial distribution Xi ^ I'e {■) and transition probabilities for n > 1 

-'^riK^liri-l = a^lin-l) fo {■\xi:n-l) ■ (1) 

Conditional upon {^n}„>i! we have a linear Gaussian state-space model defined through Zq ^ A/'(mo, Eo) 
and for n> 1 

Z„ = A9(X„)Z„_i + Bg{X^)Vn + Fe(X„)u„, (2) 
y„ = Ce{X^)Zn + D0{Xn)W^ + Ge(X„)u„, (3) 

where J^{m^Ti) is the normal distribution of mean m and covariance S, Vn A/'(Oi,xi, /t,), 
A/'(Ou.xi, ^lo), {A0{x),Bg(x),Cg{x),D0{x),F0(x),Ge(x)]x £ X} are matrices of appropriate dimension and 
Un is an exogeneous input. Here G is some static parameter which may be multidimensional, for example 
C K''. For purposes of precise specification of resampling algorithms in the sequel and without loss of 
generality we label the elements of X with numbers, for example X — {1, \X\} for some \X\ e N. We may 
then endow each Cartesian product space X^,X^, ... with the corresponding lexicographical order relation. 
From henceforth, whenever we refer to ordering of a set of points in A"" it is with respect to the latter 
relation. 

We give here a simple example of a SSSM. Two more sophisticated examples are discussed in Section |5] 
2.1.1 Example: Auto-regression with shifting level 

Let X = {0, 1} and for {X„} a Markov chain on X with transition matrix Px, consider the process defined 

by 



+ CrXnVn.2, 

where for each n > 1, fin and Yn are real-valued and {Vn,i} and {^1,2} are i.i.d. 7V(0, 1). The initial 
distribution on /io is A/'(too,o'q) and is assumed known. This is a natural generalization of a first order 
autoregressive model to the case where the level /i„ is time-varying with shifts driven by the latent process 
{X„}. This model can be expressed in state-space form by setting 



Yn fJ'n 



A0{Xn) = 



(j) 

1 



Va;„, 



B0{Xn) 







Ce(x„) = [ 1 1], D0{Xn) ^ F0{Xn) = G0{Xn) ^ 0, Vx^ 



The unknown parameters of this model are 9 — [cf) a'^ Px] where Px is the transition matrix of {X„}. In this 
model and more generally in SSSMs, inferences about the latent processes {nn} and {Xn} from a particular 
data set are likely to be highly sensitive to values of these parameters if they are assumed known. 
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2.2 Inference aims 



Our aim is to perform Bayesian inference in SSSMs, conditional upon some observations yi-^ and for some 
T > 1, treating both the latent trajectories Xi-x, Zq-x and the parameter as unknowns. Where applicable, 
the values of the input sequence ui-t are assumed known, but for clarity we suppress them from our notation. 
We ascribe a prior density p [6) to so Bayesian inference relies on the joint density 

p{9,Xi,T,ZQ,T\yi:T) OCpe {xi;T,ZQ:T,yi:T)p{S) , (4) 

where the definition of po (a;i:T, -^OiT, 2/1:t) follows from Eq. ([TJ-([2])-([3|. This posterior can be factorized as 

follows 

p{0,Xl.,T,ZO:T\yi:T) ^ P{0,Xl.,T\yi:T)Pe {zO:T\yi:T , X1:t) (5) 

where 

fn I A Pe{yi:T\xi:T)p{xi:T\d)p{0) , 

P[d,xi.,T\yi:T) = -j^-^ , 1 , s . , — Mwm:^^- 

Conditional upon Xi-^t = xi-t, Eq. ([2|-([3| define a linear Gaussian state-space model so it is possible to 
compute efficiently the statistics of the conditional multivariate Gaussian density pg {zq-.tI yi-.T, xi-.t) in Eq. 
([5| and the conditional marginal likelihood pg (j/i:t| Xi-t) in Eq. (|6]) using Kalman techniques. For example 
Pe (j/IitI xi;t) can be computed using the product of predictive densities 

T 

Pe{yi:T\xi:T) ^ 'Wgg{yn\yi:n~l,Xl..n) (7) 

n=l 

where yi-Q :— 0. The statistics of these Gaussian predictive densities can be computed using the Kalman 
filter which is recalled in Appendix |X] for sake of convenience. For simplicity of presentation throughout the 
following we assume that for each 1 < n < T and 6* € 8 the support oipg {xi-n \yi:n ) is A"". This assumption 
is satisfied in the vast majority of cases considered in practice and in all the examples we consider. The 
techniques discussed below can be transferred to cases where this assumption is not met with only cosmetic 
changes. 



3 Inference techniques for switching state-space models 
3.1 Exact Inference and Intractability 

The main difficulty faced in the exact computation of p[9^xi;T\yi:T), is the need to perform the summation 
in the denominator of Eq. ([6]) over up to \X\'" values of xi-y, where \X\ is the cardinality of X . For even 
modest values of T, this sum is too expensive to compute exactly. In the applications we consider, T is of 
the order of thousands, so exact computation is practically impossible. 

Even if 6 is treated as fixed, inference is intractable. In this case, we wish to compute pe{xi;T\yi:T), 
whose normalization involves the same problematic summation. One approach is to obtain pg{xi;T\yi:T) by 
sequential computation of pg{xi\yi) , pg{xi.2\yi:2) i ■■■ via the recursive relationship 

n\yi:n—l ; ^l:n )fg{Xn\xi:n-l)Pe{xi:n-l\yi:n-l) 
P9\Xl:n\yi:n) — ■r-^ r , \ f I ^' 

9e[yn\yi:n-l,Xl:n)Je[Xn\Xl-.n-l)Pe{Xl-.n-l\yi-.n-l) 

(with gg{yn\yi:n-ii xi;n) the predictive as defined in the previous section) but the computation involved 
increases exponentially in n. For purposes of exposition in the sequel, we remark that, as for each n the 
support oi pg{xi;n\yi:7i) is X'"' then the sequence of such supports satisfies the trivial recursion 

X'' = Xx 

and is evidently growing in cardinality with n. Hence, in both the cases of computing p{0, xi^xlyi-.T) and 
Peixi-.Tlyi-.r) it is necessary to rely on approximations and we focus here on Monte Carlo methods. 
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3.2 Monte Carlo Methods 



We next review two classes of Monte Carlo techniques to perform inference in SSSM. The first method we 



discuss is the DPF algorithm of Fearnhead (19981. For a fixed parameter value 6*, this algorithm allows 
us to compute an approximation of the posterior distribution pg {xi-tIiji-.t) and an approximation of the 
marginal likelihood pe {yi-.r)- We present this algorithm in a slightly non-standard way which allows us to 
describe it probabilistically in a concise and precise manner. This will prove useful for the development of 
the discrete PMCMC algorithms in Section [4] We also review MCMC methods which have been developed 
to approximate p {0, xi;t, zq-tIui-.t) and discuss their advantages and limitations. 

3.3 The discrete particle filter 



The DPF algorithm proposed in|Fearnhead (19981; Fearnhead and Clifford (2003 1 is a non-iterative procedure 



approximating the posterior distribution pg {xi-.tIvi-.t) and the marginal likelihood pg {yi-r)- Practically, the 
DPF approximation of the posterior distributions {pg {xi;n\yi:n) > 1} is made sequentially in time using 
a collection of [A"! weighted trajectories or "particles" i = l,...,Af|A'||, 

N\X\ N\X\ 

(xi:„|yi:„) = ^ (xi.„) , < [x'Cl) > 0, ^ (xj;) ) = 1. 

i=l i=l 

The parameter N controls the precision of the algorithm. The larger it is, the more accurate (on average) 
the approximation of the target distribution. It has been demonstrated experimentally in |Cappe et al.| 
|(2005[ ); [Fearnhead and CUfi'ord (2003| |; [Fearnhead (2004[ ) that the DPF algorithm out performs significantly . 



sometimes by one order of magnitude, the Rao-Blackwellized particle filters proposed in Chen and Liu (2000| ; 



[Doucet et al. (2000 [2001 1 and that it is able to provide very good approximations of pg (xi^tIvi-.t) in realistic 
scenarios even with a moderate number of particles. The action of the DPF can be summarised as follows. 

Assume that we have, at time step n obtained p^ {xi:n\yi:n) consisting of A^jA"! distinct particles with 
weights that sum to 1. A resampling step is then applied, exactly N of the N\X\ trajectories survive and 
their weights are adjusted accordingly. The resampling mechanism is chosen in such a way as to be optimal 
in some sense. Throughout the remainder of the paper we treat the case of minimising the sum of variances 
of the importance weights as in Fearnhead and Cl ifford (2003[ ) but exactly the same method applies to other 



schemes discussed in |Barembruch et al. (2009 j . Features of this resampling scheme which distinguish it from 
standard methods, such as multinomial resampling, are that it results in no duplicated particles and gives 
post-resampling weights which are non-uniform. 

Whereas standard particle methods rely on a stochastic proposal mechanism to explore the space, the 
DPF performs all its exploration deterministically. This is possible because of the finite cardinality of the 
latent discrete space. Consider one of N particles which survived the resampling operation, each of which is a 
point in A"". Call the point in question xi-n and denote by in^^^{xi:n) and S^^'|^(a;i:„) respectively the mean 
and covariance of the Gaussian density pg{zn\yi:n, Xi-n)- From this point \X\ new particles {{xi-n, x)]x (z X} 
are formed, and for each one of them, m^'^ , , , , (xi ■„ , x) , S^'* , , , Ax-\-„,x) and the associated unnormalized 

' ' n+l|n+l^ n+l|?^+l^ j-.t, / 

weight are calculated using the Kalman filtering recursions (included for reference in Appendix [A|) . This 
procedure is repeated for the remaining — 1 particles, resulting in A^lAj weighted trajectories. The 
weights are then normalized to yield a probability distribution constituting p^ (a;i:n-i-i|yi:n+i)- 

This outline of the DPF operations highlights the function of the resampling step: in the case of the 
DPF it acts to prune the exponentially growing (in n) tree of possible paths {xi;n S X'^;n = 1,2, ...}. It is 
convenient to specify the DPF in a slightly non-standard way which highlights that the only randomness in 
this algorithm arises from the resampling step. To this end, we introduce random support sets Si, S2, St 
with each S„ taking a value s„ which is a subset of X^. It is stressed that, in the following interpretation, 
the xi;n's are not random variables, and are just points in the state space (and Cartesian products thereof) 
used for indexing. With this notation, we write the DPF approximation for n > 1 as 

{Xl..n\yi:n) = ^ ^ (^'l:n) S.^ Jx^.,,,) . (8) 

Under the probability law of the DPF algorithm, which we discuss in more detail later, for each n > 2, 
|S„| = N\X\ A \X'" \, with probability 1. We thus see in Eq. ([8| the effect of the parameter N: it specifies the 
number of support points of the approximation p^ {xi:n\yi:n)- We next provide pseudo code for the DPF 
algorithm and then go on to discuss several issues related to its practical use and its theoretical representation. 



5 



DPF algorithm 

At time n = 1 

• Set Si = X and for each xi E X, compute m^j^(a;i) , and ge{yi\xi) using the Kalman filter. 



Compute and normalise the weights. For each xi G X, 



ufi{xi) = i.e{xi)9e{yi\xi), (x^) = "^'^'J (9) 

At times n = 2, ...,T 

• If |S„_i| < N set C„_i = oo otherwise set C„_i to the unique solution of 

^ lAC„_lW^„^_l(xi:„_l) =iV. 

a;i;„_iGS„_i 

• Maintain the Ln-i trajectories in S„_i which have weights strictly superior to 1/C„_i, then apply the 
stratified resampling mechanism to the other N \X\ — L„_i trajectories to yield N — i„_i survivors. Set SJ^_2 
to the set of surviving and maintained trajectories. 

• Set S„ = S;_i X X. 

• For each xi,,, G S„, compute rn^^^^{xi,n) , i;^';'|^^(xi:„) and ge{yn\yi:n-i,xi.,n) using the Kalman filter. 

• Compute and normalise the weights. For each xi-n E S„, 

ixi.,n) = fe{Xn\xi:n-l)9e{yn\yi:n-l,Xi.,„) ^"'^^^'"7^'' ^' (^'^) 

1 A Cn-lW^_i (Xi,n-l) 

l^x[.^eS„'^n \Xl:n) 



3.3.1 Exact computation at the early iterations 

For small n it is practically possible to compute peixi;n\yi:n) exactly. It is only once n is large enough that 
I A"" I > N that we need to employ the resampling mechanism to prune the set of trajectories. This action 
is represented conceptually in the DPF algorithm above by the artifice of setting C„ = oo if n is such that 
|S„_i| < N. When this condition is satisfied, the resampling step is not called into action. Of course in the 
practically unrealistic case that \X'^\ < N the DPF. unlike standard SMC algorithms, thus reduces to exact 
recursive computation of {peixi;n\yi:n)', n = 1, ...,T}. 



3.3.2 Computing C„ and stratified resampling 



The threshold C„ is a deterministic function of the weights {W^ {xi:n)}^_^ gg • A method for solving 
J2xi.„es^ 1 A CnW^ {xi:n) = N \s givcu in jFearnhead and Clifford (2003 1. The stratified resamphng mech- 
anism, which is employed once C„ has been computed, proceeds as tollows at time n; this was originally 
proposed in Carpenter et al. (1999[); Kitagawa (19961, although not in the context of the DPF. 



Stratified resampling 

• Normalise the weights w^^^i {xi;n-i) of the \X\ — -L„_i particles and label them according to the order of 
the corresponding xi;n-i to obtain W^^_i ; « = 1, N \X\ — L„_i. 

• Construct the corresponding cumulative distribution function: for i = 1, N \X\ — L^-i, 

• Sample Ui uniformly on [0, 1/{N — L„_i)] and set Uj = Ui + j^l^^ ^ for j = 2, N — Ln~i- 

• For i = 1, ...,7V lA"! - L„_i, if there exists j e {1, ...,N - L„_i} such that Qi_i{i - 1) < Uj < Qfj_i(«), 
then x\.j^_-^ survives. 
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3.3.3 Computational Requirements 



Assuming that the cost of evaluating fe{xn\xi:n-i) is 0{1) for all n, the computational complexity of the 
DPF is 0(1 A" I TV) at each time step due to the propagation of N\X\ Kalman filtering operations and the 
generation of a single uniform random variable. The parallelisation techniques described in Lee et al. (2010[ | 
could readily be exploited when performing the Kalman computations. 



3.3.4 Estimating pe {ui-t) 

Of particular interest in the sequel is the fact that the DPF provides us with an estimate of the marginal 
likelihood pg (yi-.x) given by 

T 

n |yi:n— 1 ) (12) 

n=2 

where 

Pe{yi) ^ ^w'{{xi) , Pe {yn\yi:n-l) ^ X! "^ni^l-.n), U > 1. (13) 
xieX a:i:„eS„ 

Inevitably, for fixed TV, the quality of the particle approximation to the distribution pg (a;i:T|yi:T) decreases 
as T increases. For fixed T, once N is larger than the DPF computes pg (yi-.r) exactly. 

Before introducing the details of the new PMCMC algorithms, we review some existing MCMC algorithms 
for performing inference in SSSM. 



3.4 Standard Markov chain Monte Carlo methods 



Designing efficient MCMC algorithms to sample from p {0 , Xi-t , zo:T\yi:T) is a difficult task. Most existing 
MCMC methods approach this problem using some form of Gibbs sampler and can be summarized as 
cycling in some manner through the sequence of distributions p{0\yi;T,xi:T,ZQ:T)i Pe (-zotTlj/itT, a^iir) and 

Pe {xi:T\yi:T,Zo;T) Or pg (xi-.tIVi-.t) ■ 

Sampling efficiently from p [6\yi;T, xi;t, zq-t) is often feasible due to the small or moderate size of 6 and 
the fact that for many models and parameters of interest, conjugate priors are available. When conjugate 
priors are not used, Metropolis-within-Gibbs steps may be applied. 

A variety of efficient algorithms have been developed to sample from pg (zo:T|yi:T, xi-t)- These methods 
rely on the conditionally linear Gaussian structure of the model and involve some form of forward filtering 
backward sampling recursion (Carter and Kohn, 1994 Friihwirth-Schnatter, 19941. Variants of these schemes 



which approach the task by explicitly sampling the state disturbances may be more efficient and/or numer- 
ically stable for some classes of models (Be Jong and Shephard, 1995 Durbin and Koopman, 2002 1. In all 



the numerical examples we consider, sampling from pg (zo:t|?/1:T, 2;i:t) was performed using the simulation 
smoother of Durbin and Koopman (2002[ ) . 



Sampling from pg {xi;t yi-.T, ^0:t) can also be performed efficiently using a forward filtering backward 



sampling recursion ( [Carter and Kohn, 1994[ Chib, 1996p when {Xn} is a Markov chain. The resulting 
Gibbs sampler is elegant but it can mix very slowly as Xi-t and Zq-t are usually strongly correlated. To 



bypass this problem. Carter and Kohn (19961; Gerlach et al. (20001 proposed to integrate out Zq-t using 



the Kalman filter as discussed in Subsection |2.2[ However, as mentioned in the introduction, exact sampling 
from pg {xi-T\yi:T) is typically infeasible as the cost of computing this distribution is exponential in T. 
Therefore, in the algorithms of 'Carter and Kohn (1996| ; Gerlach et al. (2000), the discrete variables Xi:t 
are updated one-at-a-time according to their full conditional distributions pg {xn\yi.T,Xi-n-i,Xn+i:T)- It 
was shown in Carter and Kohn (1996); Gerlach et al. (2000[ ) that this strategy can improve performance 
drastically compared to algorithms where Xi-t is updated conditional upon Zq-^t- From hereon we refer to 
the Gibbs sampler of Gerlach et al. (2000[ ) as the "standard Gibbs" algorithm. 

At this stage, we comment a little further on the method of Gerlach et al. (2000[) as it is relevant to the 
new algorithms described in the later sections. The Gibbs sampler of Gerlach et al. (2000) achieves a sweep 
of samples from pg {xi\yi.T,X2:T), Pe (a^2|jyi:Ti 2:3.7-), etc. by a "backward-forward" procedure exploiting 
the identities 



{Xn\yi:T,Xi:n-l,Xn+l:T) pg{y 

n\yi:n—l-i ^l:n )pe{x )pg{yn+l:T\yi:n,Xl:T), (14) 



and 



Pe{yn+l:T\yi:n,Xl:T) = / Pe{y )pe{z 



(15) 
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In Gerlach et al. (2000[ ), it was shown that the coefficients of z„ in pe{yn+i:T\zn, x„+i:t) which are needed 



to evaluate (|f 5[) can be computed recursively for n = T,T — 1, 1 (the backward step). Then, for each n 
1,2, ...,T, pg{y 

n|yi:n— 1: ^l:n 

) are obtained through standard Kalman filtering recursions, 
( f 5 1 is computed for each Xn G X and a draw is made from ( 14 1 (the forward step). In the resulting algorithm, 
if the computational cost of evaluating pe(a;„|a;i:„_i,a;„-|-i:T) is 0{1), the cost of one sampling sweep through 
Pe ixi\yi:T,X2:T), Pe {x2\yi:T,xi,X3:T), ctc. grows linearly is 0{T). 



More recently, adaptive MCMC methods have been suggested to make one-at-a-time updates (Giordani 



and Kohn, 20081. However, these algorithms are still susceptible to slow mixing if the components of Xi-t 
are strongly correlated. Moreover even if we were able to sample efficiently using one-at-a-time updates, 
this algorithm might still converge slowly if Xi-x and 9 are strongly correlated; e.g. if {X^} is a Markov 



chain and 9 includes the transition matrix of this chain. Hammer and Tjelmeland (2011) have suggested 
an approximate, deterministic algorithm for forward filtering-backward smoothing in switching state space 
models, and the use of this method for making independent proposals as part of a Metropolis-Hastings 
scheme. By contrast, and as we shall see in the following section, the stochastic nature of the DPF algorithm 
allows the construction not only of exact Metropolis-Hastings-type algorithms, but also exact Particle Gibbs 
samplers. It is not clear how to achieve the latter using the deterministic forward-backward algorithm of 
Hammer and Tjelmeland (2011[ ). 



4 Discrete particle Markov chain Monte Carlo methods for switch- 
ing state-space models 

A natural idea arising from the previous section is to use the output pg {xi;T\yi:T) of the DPF algorithm as 
part of a proposal distribution for a MCMC algorithm targeting pg {xi;T\yi:T) or p [9, xi;T\yi:T)- This could 
allow us, in principle, to design automatically an efficient high-dimensional proposal for MCMC. However a 
direct application of this idea would require us to be able to both sample from and evaluate pointwise the 
unconditional distribution of a particle sampled from pg {xi;T\yi:T)- This distribution is given by 

qe {xi:T\yi:T) = E [pg {xi^tIVI-.t)] , 

where the expectation is with respect to the probability law of the DPF algorithm: the stochasticity which 
produces the random probability measure pg {xi;T\yi:T) in Eq- ([8|. While sampling from qg {xi;T\yi:T) is 
straightforward as it only requires running the DPF algorithm to obtain pg [xi-Tlyi-.r) then sampling from 
this random measure, the analytical expression for this distribution is clearly not available. 

The novel MCMC updates presented in this section, under the umbrella term discrete PMCMC, circum- 
vent this problem by considering target distributions on an extended space, over all the random variables 
of the DPF algorithm. Details of their theoretical validity are given in Subsection |4.3| but are not required 
for implementation of the algorithms. The key feature of these discrete PMCMC algorithms is that they are 
"exact approximations" to standard MCMC updates targeting p{9,Xi-T\yi:T)- More precisely, on the one 
hand these algorithms can be thought of as approximations to possibly "idealized" standard MH updates 
parametrized by the number N of particles used to construct the DPF approximation. On the other hand, 
under mild assumptions, discrete PMCMC algorithms are guaranteed to generate asymptotically (in the 
number of MCMC iterations used) samples from p {6, Xi;T\yi:T), for any fixed number N > 2 of particles, in 
other words, for virtually any degree of approximation. 



In Subsection 4.1 we describe the Particle MMH (Marginal Metropolis-Hastings) algorithm which can 
be thought of as an exact approximation of an idealised "Marginal MH" (MMH) targeting directly the 
marginal distribution p{9\yi;T) of p{9,xi;T\yi:T)- This algorithm admits a form similar to the PMMH 



discussed in Andrieu et al. (2010[ ) but its validity relies on different arguments. In Subsection 4.2 



we 



present a particle approximation of a Gibbs sampler targeting p (0, Xi-xlyi-.T), called the Particle Gibbs 
(PG) algorithm. It is a particle approximation of the "ideal" block Gibbs sampler which samples from 
p {9, xi;T\yi:T) by sampling iteratively from the full conditionals pg (xi-xlyi-.T) and p{9\yi:T,xi:T)- This 
algorithm is significantly different from the PG sampler presented in Andrieu et al. (2010) and incorporates 



a novel backward sampling mechanism. Convergence results for these algorithms are established in Subsection 

14:31 
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4.1 Particle marginal Metropolis-Hastings sampler 



Let us consider the following ideal "marginal" MH (MMH) algorithm to sample from p {9, xi-.tIhi-.t) where 9 
and xi-T are updated simultaneously using the proposal given by 

q{i9*,xl.,T)\ {9,X^..T))^q{9*\0)pg. {xlrlvi-.T) ■ 

In this scenario the proposed X^.j. is perfectly "adapted" to the proposed 6* and the resulting MH acceptance 
ratio is given by 

pi9*,xlr\yi:T)q{{0,xi..T)\{9*,xlT)) ^ pg. {y.-.r) p{9*)q{9\9*) 

p{0,x,.,T\yi:T) qiie*,xl.r)\{0,xi..T)) Peiyi-.r) p{e) q{e*\0) ' ^ ' 

This algorithm is equivalent to a MH update working directly on the marginal density p {9\yi.T), justifying 
the MMH terminology. This algorithm is appealing but typically cannot be implemented as the marginal 
likelihood terms pe {yi-.r) and po' [yi-.r) cannot be computed exactly and it is impossible to sample exactly 
from po* (xuIj/Iit)- We propose the following particle approximation of the MMH algorithm where, when- 
ever a sample from pe (xi;T\yi:T) and the expression for the marginal likelihood pg (yi-.x) are needed, their 
DPF approximation counterparts are used instead. 



PMMH sampler for SSSM 

Initialisation, i = 

• Set 9{0) arbitrarily. 

• Run the DPF targeting pe(o) {xi-.Tlyi-.T), sample Xi.,t (0) ~ pe{o) {-lyi-.r) and denote pe(o) (yi-.r) 
the marginal likelihood estimate. 

For iteration i > 1 

• Sample 9* r^q{-\9 {i - 1)). 

• Run the DPF targeting po* {xi;T\yi:T), sample X'^.j, ^ pg- (•|j/i:t) and denote pg- [yi-r) 
the marginal likelihood estimate. 

• With probability 

, Pe' {yi:T) P{6*) g(g(»-l)|r) 

(2/1:t) P{e{i-l)) q{9*\9{i-l)) 
set 9{i) = 9*, Xi.,T {i) = X*^,, (yi-.r) =pe' {yi-.r), 

otherwise set {i) = 9 {i - 1), Xi.,t (i) = Xi,t [i - 1), {yi-.r) = Pg{i-i) {yi-.r) ■ 



4.2 Particle Gibbs sampler 



As discussed in Section 3.4 an attractive but impractical strategy to sample from p{9,xi;T\yi:T) consists 
of using the Gibbs sampler which iterates sampling steps from pg (xiitI^iit) and p (0|?/i:t, s^Iit) or a mod- 
ified Gibbs sampler where we insert a sampling step from (^0:t|2/1:T, a^i:T) after having sampled from 
Pe {xi:T\yi:T) to update 9 according to p {9\yi;T, a;i:T, ^0:t)- Numerous implementations rely on the fact that 
sampling from the conditional density p{9\yi.TTXi;T) or p (0|yi:T, zo:t) is feasible and thus the poten- 
tially difficult design of a proposal density for 9 can be bypassed. However, as mentioned before, it is typically 
impossible to sample from pg {xi-T\yi:T)- Clearly substituting to the sampling step from pg (a;i:r|?/i:T)j sam- 
pling from the DPF approximation pg {xi.T\yi:T) would not provide Gibbs samplers admitting the correct 
invariant distribution. 

We now present a valid particle approximation of the Gibbs sampler which assumes we can sample from 
P (^|yi:Tj xi;t)- Similarly it is possible to build a valid particle approximation of the modified Gibbs sampler 
by the same arguments, but we omit the details here for brevity. 
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PG sampler for SSSM 



Initialisation, i = 

• Set 0{O),Xi.,T (0) arbitrarily. 
For iteration i > 1 

• Sample 6{i)^p {■\yi:T, ^i:t (i - !))■ 

• Run a conditional DPF algorithm targeting pg(i) (a;i:T|2/i:T) conditional upon Xx-t (i — 1) . 

• Run a backward sampling algorithm to obtain Xx-xii)- 



The remarkable property enjoyed by the PG algorithm is that under weak assumptions it generates samples 
from p{9iXi;T\yi:T) in steady state for any number N > 2 oi particles used to build the required DPF 
approximations. The non-standard steps of the PG sampler are the conditional DPF algorithm and backward 
sampling algorithms which we now describe. 

Given a value of 9 and a trajectory x\.rp, the conditional DPF algorithm proceeds as follows. 



Conditional DPF algorithm 

At time n = 1 

• Set Si= X and for each X (which includes x\), compute ■m^^^{x{) , S^j^(a;i) and g0{yi\xi) using the 
Kalman filter. 

• Compute and normalise the weights. For each xi e X, 

wt{xx) = ,ye{xi)ge{yi\xx), "^'^l] ,y (18) 

At times n = 2, ...,T 

• If |S„_i| < N set Cn-i = oo otherwise set C„_i to the unique solution of 

J2 lAC„_i<_i(xi;„_i)=Ar. 

Xl:„-ieSn-l 

• If W^_i (a;f:„_i) > 1/C„_i, maintain the Lri-i trajectories which have weights strictly superior to 1/C„_i 
(which includes xl.^^_i), then apply the stratified resampling mechanism to the other A'^|A'| — -L„_i weighted 
trajectories to yield N — survivors. Set S^_^ to the set of surviving and maintained trajectories. 

• If W^_i (a;*.„_i) < 1/C„_i maintain the Z/„_i trajectories which have weights strictly superior to 1/C„_i 
(which excludes xl.^_i), then apply the conditional stratified resampling mechanism to the other N \X\ — 
weighted trajectories to yield — survivors (which include xl.^). Set S'^_i to the set of surviving and 
maintained trajectories. 

• Set S„ = S;_i X X. 

• For each Xi.,n & S„, update and store m^'|^(a;i;„) and S^j^(a;i:„) and compute g0{yn\yi:n-i,xi:n) using the 
Kalman filter. 

• Compute and normalise the weights. For each xi;n € Af", 

wi (a;i:„) = fe{Xn\xl:n-l)g9{yn\yi:n-l,Xl:n) , ^ ^"~r}re^''7^^ ^ (19) 

<M = y y (20) 

• If backward sampling is to be used, store W^{xi;n), m^j^(a;i:„) and S^j^(xi:„) for each e S„. 
The conditional stratified resampling procedure can be implemented as follows. 
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Conditional stratified resampling 

• Normalise the weights w^_i {xi;n~i) of the | A"! — L„_i particles and label them according to the order of the 

corresponding a;i:„_i to obtain W^,j_i (^x^il-i^ij ', i — ^, N \X\ — Define k to be the integer satisfying 

(ft) ^ 

• Construct the corresponding cumulative distribution function: for i = 1, N \X\ — Ln-i, 

j<i 

• Sample U^, uniformly on — 1), (5^„i(k)] , set Ui — U^, — — " i)^*J compute Uj = 

Ui + j^jl^^ ^ for j — 2, .... N — Ln-i- Here [a\ denotes the largest integer not greater than a. 

• For i = 1, ...,7V lA"! - Ln-i, if there exists j e {1, ...,N - L„_i} such that Qi_i{i - 1) < Uj < Q^^_i{i), 
then x\.j^_^ survives. 

The backward sampling step is an important component of the PG algorithm. In contrast to the standard 



PMCMC algorithms of Andrieu et al. (2010), it allows the sampled trajectory obtained from the conditional 
SMC update not only to be chosen from those surviving at time T, but allows full exploration of all trajectories 
sampled during the Conditional DPF algorithm. Further comments on the theoretical validity of alternative 
schemes are made in section |4.3| and demonstration of numerical performance given in section [5j 

We note that this procedure is of some independent interest for smoothing in SSSM's if 6 is known, as 
it can be combined with the standard DPF algorithm. A forward filtering-backward smoothing algorithm 
for SSSM was devised in 'Fong et al. (2002[ ), and involved joint sampling of both continuous and discrete 
variables from an approximation of P9{xi:Tt zq-tIui-.t)- The backward sampling algorithm we propose is 
different because the continuous component of the state is integrated out analytically, giving a further Rao- 
Blackwellization over the scheme of Fon g et al. (2002| . Furthermore, the fact that the backward sampling 
algorithm involves sampling only discrete-valued variables is central to the validity of the PG algorithm, 
discussed in the next section. Details of the matrix-vector recursions necessary for the implementation of 
the backward sampling procedure are given in Appendix [B| 

Backward Sampling 

At time n^T 

Sample a path X^.j. from the distribution on St C defined by {W^{xi;t)}, then discard X'^.rp_^ to yield 



= X^. Set Et = 0, mt = 0. 
At times n = T - 1, 1 

• Update S„ and /x„ as per the procedure of Appendix [B| 

• For each xi-^ G S„ compute the backward weight 

) W^{Xl■n)pe{Xn+l,T\Xl:n)Pe{yn+l:T\yl■.n,Xl■.„,x'„_^_J^.T) 

where 

Pe{yn+l:T\yi:n,Xi;n,x'„^l.rp) = / Peiyn+l:T\Zn, x'„^i,T)Pe{Zn\xi:n, yi:n)dZn 



is evaluated using S„ and the stored m^'|^(xi:„) and I]^|^(xi:„) of pe(z„|xi:„, ?;i:„) as per Eq. ([28| in 
Appendix [B| 

• Normalise the backward weights {y,^ (a^im |^n+i T)}j; gs ^"^"^ draw from the distribution they define on 
S„ C A" to obtain X[.^^. 

• If n > 1 discard X'i.^_i, otherwise output X[.rp. 
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4.3 Validity of the algorithms 

The key to establishing the vaHdity of the PMCMC algorithms is in showing that these are standard MCMC 
algorithms on an extended state-space including all the random variables introduced in the DPF algorithm. 

The first step is to observe that under our representation of the DPF algorithm, its operation remains 
essentially unchanged if at each iteration we adopt the convention of setting wfj(a;i:„) = W,j (a^im) — for 
all xi;n ^ S„, and to replace all summations over S„ with summations over A"". We assume this convention 
throughout the remainder of this section, i.e. from now on, W,^ is a set of weights over X", but only those 
weights over S„ are non-zero. In this case the solution of J2xi es ^ ^ ^nW^ (xi-.n) — N is identical to 
the solution of J2xi eX" ^ ^ ^nW^ {xi-.n) — N . Furthermore we can consider the resampling mechanism 
as acting on all trajectories in X"^ and not only those in S„; those with zero weights clearly fall below the 
threshold 1/C„ and there is zero probability of them surviving the resampling operation. As we shall see, 
the intuitive implication of this observation is that once a trajectory xi^n has been discarded, it is lost and 
for any m > n , any subsequent trajectory {xi:n,x'„_^_i.^) € A"™ is also assigned zero weight. We denote by 
the set of normalised importance weights at time n, that is W^_]^ :— {W^^_i{xi;n-i),xi:n-i G A"^^} 

We next write an expression for the joint distribution of the sequence of random support sets Si, S2, 87^ 
generated through the DPF algorithm. By definition of the algorithm, for rt > 2, S„ is conditionally 
independent of the history of the algorithm given Wfj_i 

S„l(W«-i=w^O-^n(-|w:_i), (21) 

where for each N, n and wfj„i, i'l'^n^i) can be understood as a probability distribution over the set 
of subsets of X", and we denote this set of subsets by 'P{X^). This distribution is parameterized by N 
because for all n > 2, for each point s„ in the support of {■[^n-i) i \^n\ = N\X\ . In the case of n = 1, 
r^{.)=l[. = X]. 



We will not need an explicit expression for the distribution (21 1, but from the definition of the optimal 



resampling mechanism (Fearnhead, 1998 Fearnhead and Clifford, 20031, we know that it has the following 



marginal property: for all Xi-^i & X^ , we have 

ixi:n G s„|w^_i) = 1 A C„_i<_i (Xl:„„l) . (22) 

where we have adopted the abusive notation that 



Eq. ([221 implies 

{Xi:n G S„|u;^_i {Xi:n-l) = 0) = 0. 



Combined with Eq. (21 1 we see that for any n and xi-n-i, conditional on the event that W^_i {xi;n-i) = 0, 
any subsequent paths wliich have xi.„_i as their first n—1 coordinates are also assigned zero weight and are 
not members of any subsequent S„ . Thus the corresponding subsequent weights need never be computed or 
stored, as required to control the cost of the algorithm. We thus have the property as claimed earlier that 
once a trajectory is discarded it is not recovered. To summarize the law of the DPF algorithm, we can write 
the density of Si, 83, St on HLi ^iX"-) as 

T 

V',^(si,S2,...,ST)=rf(si)n''^(Sn|w:_i). (23) 

n=2 

As the weights are just a deterministic function of Si, . . . , S„, it is not necessary to introduce them as 
arguments of . 

The key to the PMCMC algorithms described here is to define the following artificial target density on 
e X A"^ X Hlzl V{X") through 

TT^ie, Xi:T, Si, S2, St) - p( 9, X^..t\ Vi-.t) I n e 1 ^T '^\S!''^''""^r\ ^ (^4) 

l„=2 J n„=2^^(^l:"GS„|w«_i) 

which admits p{0 ,xi-T\yi:T) as a marginal by construction. Let 7r^(a;i:T, Si, S2, st) denote the density 
of Xi:T, Si, S2, St conditional upon under 7r^(6', a;i:T, Si, S2, st). In the following results we show 
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that the PMMH and PG algorithms are just standard MCMC updates targeting this artificial distribution. 
Proofs can be found in Appendix [C| 

We first present a result establishing the convergence of the PMMH sampler which relies on the following 
assumption. 

(Al) The MH sampler of target density p {6\ yi-.r) and proposal density q(6*\9) is irreducible and aperiodic 
(and hence converges for almost all starting points). 

We have the following result. 

Theorem 1 For any N > 2 

1. the PMMH sampler is an MH sampler defined o n th e extended space e X A-^ X nLi ^('^") with target 
density tt^ {6,Xi;T,Si,S2, ■■■,Sx) defined in Eq. {24) and proposal density 

q{9*\e) < {xI,t) ^9' (s*i,s;,...,s5,) (25) 

where (xI-t) the realisation of the normalised importance weight associated to the population of 
particles proposed by the DPF algorithm. 

2. if additionally (AW holds, the PMMH sampler generates a sequence {9 (i) , Xi-t (i)} whose marginal 
distributions {C^^{9 (i) , Xi-t (i)) G ')} satisfy 

\\£^ {{9{i),Xi.,T{i)) e ■)-p{-,-\yi:T)\\j,y^O asi^^ . 

for almost all starting points. 



Next we consider the backward sampling procedure and establish its invariance properties. 

Proposition 1 For any N >2 and 9 E Q, assume (A'l:^, Si, S2, Sy) is distributed according to tt^ {■) 
and let X[.j, be the trajectory obtained at any time step m of the backward sampling procedure operating on 
(Xi;T, Si, S2, St)- Then X[.rp is distributed according to pg (xi-.tIvi-.t)- 

We now state a sufficient condition for the convergence of the PG sampler and provide a simple conver- 
gence result. 

(A2) The Gibbs sampler defined by drawing alternately from the conditionals p {9\yi;T, xi-.t) and pg (xi:t|?/1:t) 
is irreducible and aperiodic (and hence converges for p-almost all starting points). 

We have the following result. 

Theorem 2 

1. steps 1 — 4 0/ the PG update define a transition kernel on t he e xtended space Q x A"^ x 11^=1 ^('^") 
of invariant density tt^ {9, xiit, Si, S2, Sy) defined in Eq. { 24 ) for any N > 2. 



2. if additionally holds, the PG sampler generates a sequence {9 (i) ,Xi-t («)} whose marginal dis- 

tributions {CpQ {{9 (i) , Xi-T (?)) G •)} satisfy for any N > 2 

\\C^GiiHi).Xi.,T (z))e •)-p(t|2/1:t)||,„^0 asz^cx), 

for almost all starting points. 

Remark 1 The reader will observe that as Proposition^ applies for any time step of the backward sampling, 
modification of the PG algorithm to the case where Xi-^t (i) is set to the X[.rp obtained at any time step of 
the backward sampling procedure also corresponds to a Markov kernel of the required invariant distribution. 
For example, one could simply apply only the first backward sampling step: sample X[.j, from the distribution 
defined by {Wt{xi;t)} o,nd then set Xi-^t (?) = X'-^.j.. The resulting algorithm is closer akin to the original 
Particle Gibbs algorithm of Andrieu et al. (201(^ . However, in numerical experiments in the context of 



SSSMs this approach has been found to be relatively inefficient. This phenomenon is discussed further and 
demonstrated numerically in section^ 
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5 Applications 



5.1 Example 1: Autoregression with shifting level 

In our first numerical experiments we return to the toy model specified in section |2.1.1| and address some 
generic issues regarding algorithmic settings and performance. 



5.1.1 Particle Gibbs and the effect of backward sampling 

We first demonstrate the effect of applying the backward sampling procedure as part of the PG algorithm. 
The purpose of this section is to show the importance of applying backward sampling as part of the PG 
algorithm and to show its advantage over the standard Gibbs sampler. From hereon we refer to as "PG 
without backward sampling" the alternative PG scheme described in Remark [T| which involves sampling 
X[.rj, from the distribution defined by {Wt{xi-t)} and immediately setting Xi-x (*) — X'-^-t- 

Recall that for this model the parameters are 6* = [</> cr^ Px]- Conjugate priors are readily available: a 
Gaussian distribution for 4>, inverse-gamma for and independent Dirichlet for each row of Px- A data 
record of length T = 1000 was generated from the model with true parameter values of 0.1, a = 0.1 and 

0.99 0.01 Y\a.t Dirichlet priors were set on each row of Px- A A/'(0, 10) distribution restricted 



P 



X 



0.99 0.01 



to < 1 was set over and a (0.1, 0.1) inverse gamma distribution was set over a^- The initial distribution 
over /ip was A/'(0, 10). For various numbers of particles the PG algorithm, with and without backward 
sampling, was run and compared to the standard one-at-a-time Gibbs algorithm in terms of the sample lag 
1 autocorrelation for each component of the discrete latent trajectory {X„}. In all cases the simulation 



smoother of Durbin and Koopman (20021 was used to sample from pe{zo:T\yi:T, xi;t)- 

In both panes of Figure [1] the vertical dashed lines show the true times at which X„ = 1. The bottom 
pane shows the lag 1 autocorrelation for PG with backward sampling and the standard one-at-a-time Gibbs 
sampler: here it was found that for all components of the trajectory, increasing N monotonically decreased 
the autocorrelation and for any N the PG algorithm exhibited lower autocorrelation than the standard 
one-at-a-time algorithm. Spikes in the autocorrelation coincide with the true times at which X„ = 1 and 
between these times the autocorrelation, even using the standard Gibbs sampler, was found to be very low. 
By contrast, for the PG without backward sampling and the same numbers of particles, the autocorrelation 
from the PG algorithm was higher than that from the standard Gibbs algorithm for most components of 
the discrete trajectory. In all cases the sample autocorrelation was computed from 10^ iterations after a 
burn-in of 10^ iterations. After the lO'' -I- 10^ iterations, with = 10 and iV = 20 particles, the PG without 
backward sampling had entirely failed to converge: in the plots of Figure [T] we use the ranges in which 
the plots reach the value exactly 1 to represent those components of the discrete trajectory never having 
changed from their initial condition (such a sample sequence does not have a well defined autocorrelation as 
its sample variance is zero). Very similar results were observed for other initialisations and data records. 

This performance can be explained in terms of the well-known particle path degeneracy phenomenon 
which arises from the resampling mechanism in SMC algorithms: the act of repeated selection of sampled 
paths inevitably leads to a loss in diversity in their early components. In the present context the path 
degeneracy influences the performance of the PG algorithms via the conditional DPF update. During the 
conditional DPF operation at MCMC iteration i -I- 1, by construction of the conditional DPF, Xi-t («) is 
forced to survive until time step T- Thus for the PG without backward sampling, for some m < T, the path 
degeneracy phenomenon implies there is a significant probability that Xi-m (*) coincides with Xi;„i{i + 1). 
This explains the strong correlations between components of consecutive samples of the latent trajectory 
shown in the top pane of Figure [l] By contrast, backward sampling provides a chance for the path degeneracy 
to be circumvented. The CPU time for one iteration of the PG with backward sampling was found to be 
between 1 and 1.5 times that without backward sampling for the same number of particles. The results 
therefore indicate that overall it is significantly more efficient to use the backward sampling method and 
from now on it is the only PG algorithm we consider. 

Figure [2] shows sample autocorrelation as a function of lag for various numbers of particles from the PG 
algorithm with backward sampling and the standard one-at-a-time Gibbs sampler. We observe that using 
large N leads to lower autocorrelation and very little decrease in autocorrelation was observed using more 
than = 50 particles. As we go on to discuss in more details in the next section, under the Dirichlet 
prior for each row of Px it is possible to analytically integrate out Px both when using the standard Gibbs 
sampler and the PG, and we did so. The above experiments were also conducted in the case where Px is 
not integrated out and we obtained results which were almost identical (not shown). 
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Figure 1: Example 1. Sample lag-1 autocorrela- 
tion for each of the discrete trajectory components 
{Xn{i), n = 1, 1000} with (bottom) and without 
(top) backward sampling for various numbers of par- 
ticles: □ : TV = 10; x: iV = 20; O: ^ = 50; In both 
top and bottom * is sample autocorrelation for stan- 
dard one-at-a-time Gibbs. Vertical dashed lines are 
true locations of Xn = 1. 
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Figure 2: Example 1. Autocorrelation against lag 
for standard Gibbs sampler (*) and PG with back- 
ward sampling and various numbers of particles: □: 
TV = 10; X : iV = 20; O: ^ = 50. Top pane is for 
and bottom pane for a^. 



5.1.2 Treatment of Px 

A common feature of SSSMs is that it is possible to analytically integrate out Px under Dirichlet priors for 
each of its rows and the autoregressive model with shifting level is no exception. It is natural to ask, even in 
the context of standard MCMC algorithms, whether it is beneficial to perform this integration analytically, 
or to treat Px as part of the sampling problem. To the authors' knowledge, in the context of SSSMs this 
issue has not been treated in the literature. 



Consider first the standard one-at-a-time Gibbs sampling case. The reader will recall from section 3.4 



and Gerlach et al. (2000 1 that the algorithm involves sampling from 



)P0{yn+l:T\yi:n,Xl:T) (26) 

for each n. Conditionally on Px, the process {-'i^n}„>i is Markov and so in the above display we have 
the simplification pg{xn\xi:n-i,Xn+i:T) = Peixn\xn-i,Xn+i)- Convcrscly, when Px is integrated out, in 
which case the parameter reduces to = [</> cr^], the process {^n}„>i is not Markov and the former 
simplification is not applicable. Thus, in terms of the correlation structure of the Markov chains generated 
by the corresponding Gibbs samplers, there appears to be a trade-off between conditioning on Px and 



conditioning on components of the {^ti}„>i process when drawing from distributions of the form (26 I. In 
terms of computational cost there is no significant difference: in the case that Px is integrated out analyacally 
evaluation of pg(xn\xi;„^i, Xn+i-.x) requires only state-transition count statistics which are cheap to compute 
and store. 

Analogous remarks to those above hold for the PG algorithm. It involves computing fe{xn\xi;n-i) in the 
conditional DPF step and peixn+i:T\xi:n) in the backward sampling step and it is in these places that the 
same conditioning issues arise. In our numerical experiments for this model and others we were unable to 
establish that either incorporating Px into the sampling problem or integrating it out analytically lead to a 
significant advantage in terms of sample autocorrelation, both for the standard one-at-a-time Gibbs sampler 
and the PG algorithm (results not shown). It would be very interesting to study the theoretical properties 
underlying this issue in Gibbs sampling algorithms for SSSMs but such an investigation is well beyond the 
scope of this article. 

We found more obvious effects in the context of the PMMH algorithm, which we now go on to discuss. 
In this case fe{xn\xi;n-i) is computed as part of the DPF algorithm, which is where the same conditioning 
issues arise. A data record of length T = 1000 was generated from the model with the same true parameter 
values as stated in the previous section. The same prior distributions were also employed. Central to the 
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performance of the PMMH algorithm is the normahzing constant estimate peiui-.r) computed using the 
DPF. When the variance of this estimate is large the PMMH algorithm performs poorly, exhibiting a high 
rejection rate - a characteristic shared with the standard PMCMC algorithms in Andrieu et al. (2010[ ). We 
found that in the two cases (where Px was integrated out and where it was not) , the DPF exhibited striking 
differences in the variance of this estimate. The parameter 9 was set to its true value and the DPF was run 
1000 times on the simulated data set. Figure [s] shows the sample variance of logpe(2/i:n) as a function of 
n. The bottom pane corresponds to the case in which Px is integrated out analytically. In this case the 
sample variance grows super-linearly with n. By contrast, as shown in the top pane, when conditioning on 
Px the variance grows far more slowly. Very similar results were obtained when conditioning on values of 
Px other than the truth. A step towards explaining this phenomenon is noting that integrating out Px 
destroys the ergodicity properties of the latent process {Xn} conditional on 9. For standard SMC algorithms 
it is now theoretically well understood that assumptions about the ergodicity properties of the latent process 
are central to establishing linear growth rates (with respect to n) for the error in normalizing constant-type 
estimates ( Cerou et al., 20l"o| . Our numerical results are consistent with the DPF having similar properties. 

The variance of ?J6»(j/i:t) influences the acceptance rates of the corresponding two PMMH algorithms. Of 
course the trade-off is that when implementing a PMMH algorithm which incorporates Px into the sampling 
problem one has the added burden of designing proposal moves for Px and the contribution to the variability 
of the MH acceptance ratio from these proposals also influences the acceptance rate. In our experiments we 
found that an effective approach to making proposals for Px was to reparameterize the model in terms of the 
unnormalized components of each row of Px , with the Dirichlet prior corresponding to gamma priors over 
these components. Proposals could then be made using log-Gaussian random walks (an analogous approach 
was advocated in the Jasra et al. (2005) in the context of static mixture models). In numerical experiments 
we adopted this approach with independent log-Gaussian random walk proposals made on each unnormalized 
component of Px- After a couple of preliminary runs, the standard deviation of the increment in the log 
domain was set to 0.05. A log-Gaussian random walk proposal with the same standard deviation was also 
used for the parameter and a Gaussian random walk with standard deviation 0.1 was used for (p. For 
the case where Px is integrated out we used the same proposals as above for (j) and a^. Figure |4] shows the 
PMMH acceptance rates as a function of the length of the data record. These results were obtained over 10^ 
iterations of the algorithms after a burn-in of 10^. The results show that the acceptance rate drops much 
more rapidly in the case that Px is integrated out. However, we cannot conclude that the PMMH algorithm 
is always more efficient when Px is incorporated into the sampling problem as the overall efficiency naturally 
depends on the particular choice of proposal mechanism for Px ■ Our numerical results do indicate that even 
using a fairly simple proposal mechanism for Px one can obtain acceptance rates which are superior to those 
in the case that Px is integrated out analytically and the autocorrelation plots in Figure [5] show that this is 
carried over to lower sample autocorrelation for the parameters (j) and a^. 



5.2 



Example 2: 
ments. 



Multiple change-point model with dependence between seg- 



There is an extensive literature on statistical time series analysis based on multiple change-point models. In 
such models it is often assumed that given the position of a change-point, the data after that change-point 



are conditionally independent of those before, see for example Barry and Hartigan (1993); Fearnhead and 
Liu (2007), amongst many others. This modelling assumption may be restrictive in some circumstances. 
A natural way to relax it is via a SSSM, which allows the notion of change-points to be introduced whilst 
allowing potentially complex dependence structures across segments of the data. 

We consider a multiple-change point model in which observations arise from a latent process which is 
piece-wise linear. Changes in the latent process are of two varieties: those in which there is a discontinuity in 
the latent trajectory and its gradient and those in which there is a discontinuity only in the gradient. More 
speciflcally, we have X = {0, 1, 2} and we assume that {X„} is Markov with unknown transition matrix Px- 
The observations are real- valued, as are the latent trajectory and its gradient {/i„}. In state-space 
form we have 
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Figure 3: Example 1. Sample variance of logpe(j/i: ,i ) 
for fixed ^ as a function of n for various numbers of 
particles: *: AT = 10; □: A'' = 50; x: N = 100; Q: 
N = 200. Top pane is conditional on the true value 
of Px and bottom pane is with Px integrated out. 



Figure 4: Example 1. PMMH acceptance rate as 
function of data record length for various numbers 
of particles: blue: N = 10, green: A'' = 50, red: 
N = 100, black: N = 200. Top pane is PMMH 
making proposals for Px and bottom pane is with 
Px integrated out. 




Figure 5: Example 1. Sample autocorrelation against lag for various numbers of particles, *: N = 10; □: 

iV = 50, x: TV = 100: O: N ^ 200. Left column is PMMH with Px sampled and right column is with Px 
integrated out analytically. Top plots are for </> and bottom plots are for ct^. 
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Figure 6: Example 2. Top: well-log data. Middle and bottom panes are estimated p{Xn = 2\yi-T) from 
respectively the standard Gibbs and PG samplers. 



Here A is a fixed time incremement and the unknown parameters are 9 = [uy cr^ o '^^p.i Px\- We apply 
this model to the analysis of well-log data: measurements of the nuclear resonance of underground rocks, 
as studied originally in O Ruanaidh and Fitzgerald (1996] ). Observations arise from a drill bit which passes 
down through layers of rock over time and each datum is a measurement of the resonance of the rock through 
which the bit is passing at that time. The aim is to identify segments in the data, each corresponding to 
a stratum of a single type of rock. The data set we analyse was treated in Fearnhead and Clifford (2003| ; 
Fearnhead and Liu (2007, [2010 1 under a variety of models, but in all these cases the static parameters 
of the models were assumed known. In Fearnhead and Liu (2010 ) a change-point model with dependence 



across segments was employed and its advantages in terms of avoiding spurious detection of change-points 
was demonstrated. We are interested in similar analysis, but without assuming fixed values for the static 



parameters of the model. As in Fearnhead and Clifford (2003 1; [Fearnhead and Liu (2007 2010 ) a few extreme 
outliers were removed from the data set manually resulting in 3975 data points. 

Flat Dirichlet priors were set on each row of Px- Independent inverse gamma (2,3) priors were placed 
over cTy, cr^ q and cr^ j^. In our experiments, inference was found to be insensitive to choice of parameters 
for these inverse gamma priors (not shown). For the initial distribution over Zq we set a relatively diffuse, 
zero mean Gaussian prior with diagonal covariance components 100 and 100, corresponding to /iq and /ig 
respectively. We set A = 0.1. The standard Gibbs sampler was run for 2 x 10® iterations and PG sampler 
with TV = 50 for 4 X lO** iterations so as to equate computational cost. Histograms of sample output for Cy, 
(T^ Q and cr^ I are shown in Figure jT] These results indicate that despite the long run the standard Gibbs 
sampler has not converged: most noticeably in the case of the histograms for cr^ j^, it appears not to have 
explored the support as thoroughly as the PG sampler and has become stuck in a mode of the distribution. 
The difference in performance is even more striking when considering the corresponding estimated posterior 
probabilities for the latent switching process. Figure [6] shows the estimated marginal posterior probabilities 
of each X„ being in state 2 (recall this state corresponds to a discontinuity in the latent process {/i„} and its 
gradient) for each time step of the data record. Due to the lack of full exploration of the parameter space, 
the results for the standard Gibbs sampler show erroneously high posterior probabilities that each Xn is 
in state 2. We can conclude that for the same computational cost the performance of the PG sampler is 
superior. 



5.3 Example 3: Exchange Rate Model 

The following model was investigated in Engle and Kim (1999[ ); Friihwirth-Schnatter (200"6| , where it was 
used to analyze economic data. The model consists of a latent random walk component observed in auto- 
regressive noise, where the variance of the observation noise innovations can switch between different values. 



In Engle and Kim (19991, this model was advocated to reflect the heteroscedasticity evident in the price 
index adjusted U.S. /U.K. exchange rate during the late 19th and 20th centuries. The data consist of 1322 



monthly log exchange rate values. We consider the case treated in Friihwirth-Schnatter (2006) where the 
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Figure 7: Example 2. Histograms estimates of posterior marginals. Top row: standard Gibbs sampler. 
Bottom row: PG sampler. Columns from left to right are cr^ i, af^Q and ay- 



auto-regressive noise process is of order 2 and there are 4 switching states. In this model, X = {1, 2, 3, 4} and 
the discrete latent process {X„} is a Markov chain with transition matrix Px- The observations {¥„} are log- 
exchange rate values. The latent process {fin} is a random walk and we denote by {?7„} the auto-regressive 
noise process: 

fl-n ^" ^ 

Vn = ainn-i + a2?7„-2 + a-^,x„ V'„,2 
where and {K1.2} are i.i.d. 7V(0, 1) noise sequences. In state-space form we then have 
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Ce{xn) = [ 1 1 ] , De[xn) = Fe{xn) = Ge[xn) = 0, Vx„. 

The unknown parameters of the model are 9 = [cr^ 2 '^'^3 '^^4 '^1 '^2 Px]- Under symmetric 

priors the labeling of the discrete states is not identifiable. We consider the same prior distributions on 
the parameters and initial conditions on Za as in Friihwirth-Schnatter (200"6| and we refer to the latter for 
full details, including a stability constraint on the auto-regressive coefficients (01^02). The only difference 
is that we do not impose an identifiability constraint a priori on cr,^ , cr^ 2 ? '''^ 3 7 "'^ 4 1 but instead target 



the unidentified model and impose the ordering cr,^ 



n,3 



< a. 



after sampling (see 



Friihwirth- 



Schnatter (2001); Jasra et al. (20051 and references therein for various approaches to drawing inference in 



models with unidentifiable state labels). 

We implemented an algorithm for this model with Px incorporated into the sampling. Each iteration of 
the algorithm consisted of a sequence of two PMMH updates. The first holding Px and c,^ 1 , 2 ' "'^ 3 1 4 
constant and the second holding (ai 02) and constant (using standard arguments for Metropolis-within- 
Gibbs algorithms and Theorem [T] it is straightforward to show this sequence of updates is invariant with 
respect to the extended target distribution). After a couple of preliminary runs the following proposals 
were selected. A symmetric random walk proposal of standard deviation of 0.001 was used for (01^02) and 
for cr^ a log-Gaussian random walk with log-domain standard deviation of 0.01. We used a mixture of log- 
Gaussian random walks for the unnormalised components of Px and af^ ii "'^ 2' '^^ 3' '''^ 4- For each individual 
parameter, the mixture had two components, the first with weight 0.9 and standard deviation 0.05 in the 
log domain and the second with weight 0.1 and standard deviation 1 in the log domain. With these settings 
and N = 200 we achieved an overall acceptance rate of 0.2. This is a reasonable rate given the mixture 
proposals. The algorithm was run for 2 x 10^ iterations after an initial burn-in of 10**. Inferential summaries 
are presented in Figures [8|T0 We note that there are some differences between the results we obtained and 
those from Friihwirth-Schnatter (200"6| , where a standard Gibbs sampler was applied. We conjecture that 
the latter had not fully explored the support of the posterior distribution. Noticeable differences are that the 



posterior marginal for ct^ i we obtain is more diffuse than that reported in Friihwirth-Schnatter (2006 1 and we 



obtain a much ffatter trajectory in the posterior estimates of in Figure TU Another significant difference 
is that we obtain concentration of the marginal posterior over the auto-regressive coefficients (01,02) in a 



19 




Figure 8: Example 3. Histogram estimates of posterior marginals and scatter plots of pairwise marginals for 
the exchange rate model. 



different region than that reported in Friihwirth-Schnatter (2006| . Using other proposals for (ai, 02) we were 
not able to find another major mode. Furthermore the posterior marginal for cr^ we obtained is concentrated 
on lower values. Overall, we feel that the ability to integrate out approximately the latent variables makes 
the PMMH algorithm a powerful tool: as these results demonstrate it gives us the chance to explore regions 
of posterior support which Gibbs sampling algorithms may struggle to find. 



6 Discussion and extensions 

In this article, we have proposed new PMCMC algorithms relying on the efficient DPF algorithm to perform 
Bayesian inference in SSSM. We have shown experimentally that these generic discrete PMCMC algorithms 
outperform current state-of-the-art MCMC techniques for a given computational complexity. Moreover the 
DPF can be easily parallelised so further substantial improvements could be obtained. 

There are various possible extensions to this work. First, we have restricted ourselves to SSSM but the 
DPF can be applied to any model where the latent process is discrete-valued. This includes for example 



Dirichlet process mixtures (Fearnhead, 20041 and the infinite hidden Markov model introduced in Teh et al 
|(2006[ ). Compared to the SSSM framework, the differences are that, in these scenarios, X„ takes values in a 
set whose cardinality increases over time and computations required to evaluate the importance weights are 
not performed using the Kalman filter. However, the discrete PMCMC methodology discussed here can be 
straightforwardly extended to these cases. Second, it would be possible to extend the DPF and the associated 
discrete PMCMC methodology by using look-ahead techniques. In a look-ahead strategy with an integer 
lag L, we resample trajectories at time n by considering the weights proportional to pg {xi;n\yi:n+L) instead 
of pe {xi-n\yi:n) for the standard DPF. This is obviously more expensive than the DPF as computation of 
the weights involves summing over a;„+i:i for each particle, but this might be of interest in scenarios where 
future observations are very informative about X„. 
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Figure 9: Example 3. Histogram estimates of marginal posterior distributions for entries of the state tran- 
sition matrix Px- Panes are arranged as per the transition matrix itself. 
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Figure 10: Example 3. Top left: data (solid) and E [^„| yi-.r] (dashed). Bottom left: E [ct^ ^ | yi-.r] ■ Right: 
estimated posterior probabilities p{Xn ~ d\v\:T) for, top to bottom, j = 1, 2, 3, 4. 
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A Kalman Filter 



Conditional upon Xi-^ = xi-t-, Eq. (|2|-([3| defines a linear Gaussian state-space model. The Kalman 
filter allows us to compute recursively in time pe (^nl yi:n-i, a;i:n) = ^f {zn, "m^nin-i i^i-n) 7 ^^f„_i (a^im)) , 
P9 {zn\ yi:n, Xi;n) = M [zn] (xi:„) , S^^j^^ {xi-.n)^ and the predictive density 

99 {Vn\ a;i:„) = TV [vn, m^n\n~i i^i-n) , ^^f„_i (a^iin)) • For n > 1 these statistics are computed using 
the following recursion initialized with m^^^^ = rriQ, = Eg 

S^„^„_l (a;i:„) = Ae{XnWn_i\r,-l^l i^n) + Be{Xn)Bj {Xn), 

(^1:") = '^e{xn)m''^^^_^ {xi,„) + Ge(a;„)u„, 



B Backward Sampling 

A key component of the backward sampling algorithm is the evaluation of the backward weight 

)Peiyn+l:T\yi:n, Xi-.n, x',^_^^-^.rp) 

for each candidate sub-trajectory xi^n and where x'^_^i.rp is the complementing sub-trajectory which has been 
obtained from previous steps of the backward sampling procedure. Central to the computation of this weight 
is the identity 

Pe{yn+l:T\yi:n,Xi..n,x[^^j^.rp) = j pe{y )p0{Zn\xi;myi:n)dZm (27) 

where pe{z.n\xi:myi:n) is the Gaussian conditional filtering density associated with the sub-trajectory xi;n 
and is specified by its mean vector {xi-.n) and co-variance matrix S^^^ {xi-.n)- In order to com- 

pute ( |27[ ) (at least up to a constant of proportionality) it is necessary to obtain the coefficients of z„ in 
Pd{yn+i:T\zn, x'^_^_^.rp) . The latter can be expressed as 

Pe{y ) cx exp 

where S„ and /i„ are respectively a matrix and vector of appropriate dimension, both depending on x'^^i.rp, 
Hn+i-.T and 9. In the following this dependence is suppressed from the notation for convenience. For ease 
of presentation we use the similarly abusive conventions in writing to„ = {xi:n)i S„ = (xi-.n), 

An = A0{Xn), Br, = [Be{Xn) Qzxw], = Cg{Xn), £>„ = [Oyxv Dg{Xn)] , F„ = Fg{Xn), G„ = Gg{Xn). Then 

let T„ be a matrix satisfying S'^f^ (xi-.n) = ^n'^n- We have 



' 2 \ n ^nZn 



Pe{yn+l:T\yi:n,Xl:n,x'n^l.rp) 

1 r 



(X exp 



mf^^nmn - 2/1^ m„ - (/x„ - ^nrrinY T„ (T„^„T„ + /) ^ (/z„ - Ennin) 



-1/2 



(28) 



where / is the identity matrix of appropriate dimension. We now specify equations for updating (/z„,S„), 
which are given without proof of validity: they are a direct application of Lemmata 1 and 2 in |Gerlach] 



et al. (2000 1 . As in Gerlach et al. (2000 1 , for simplicity we present recursions only for the case in which the 
observations are scalar- valued, but they can readily be extended to the vector-valued case. Let 



^n+l — {Cn+lBn+l + -Dn+l) (Cn+1-B„+1 + -Dn+l) , 

= (l - ^n+lCn+l) ^n+l, 
O-n+l = (l ^ ^n+lCn+l) ^n+lUn+l — ^n+lG'n+lUn+l, 

and let r„_|_i be a matrix which satisfies 



The recursion for (/i„, is then given by 

• Set = 0, — 0. 

• For n = T - 



Mn+i 



r^+i^„+ir„+i -I- /, 



fJ-n 



+Aj+iC'^.l (Un+l — Gn+lUn+1 — Cri+1 ) ■ 

r„+i 



' n+1 



C Proofs 



Proof of Theorem [T} We obtain from Eq. ([22])-(|23|-([24]) that on the event xi,t £ St, 

Pe{xi:T\yi:T) {nl=2^[2^1:« ^ ^"]} 

~ wi.ix,..T)ulzla^Cnw?,{x,..n))' 



It follows from Eq. ( 10 1-( 11 ) that on the event xi-^t & the normalized weight can be expanded as follows 

T 

Wt {xi:t) = i^e{xi)g0{yi\xi) /e( )9e{yn\yi:,i-l,Xl:n) 



n=2 



T-1 



(29) 



Hence, using Eq. (12l-(13), we obtain 

7rf (xi:T,Si,S2, ...,St) 



^ Pe (vi-.t) 

(a;i:T) V'^ (si,S2, ...,st) Pe{yi:T)' 



(30) 



From ( 30 1 , we can now easily establish that an MH sampler of target density ( 24 1 and proposal density ( 25 1 



admits indeed Eq. ( |17[ ) as MH ratio and the first part of the theorem follows. The second part of the proof 
is a direct consequence of Theorem 1 in Andrieu and Roberts (2006| a nd ■ B 

Proof of Proposition [T] First note that, from Eq. (24l and Eq. (29l, for 6',Si,S2, ...,st in the support of 
7r^(6',Si,S2, ...,st). 
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7r^(xi:T|6', Si,S2, ...,st) oc pe{xi;T\yi:T 



^ ^ il V'^(si, S2, St) 

< I I l[xi:n e S„ S — — 

[n=2 J nL2^^(2^1:nes„|w«_i) 



) 

l)ge{yn\yi:n-l,Xl:n) 



(31) 



(X Wrp{Xl;T)- 

Furthermore, for 1 < n < T — 1, 



\n \ / \ ^ 1 TT irr ^ il §2, S„) 
TT (a:i:„|6',X„+i:T,Si,S2, ...,S„) CX PS[Xl:T\yi:T) < [[HX^k & Sk\ > ^fn m r~0 



Peixi:n\yi:n) , \ \ I | A 

» ~iv7I |...fl :^Pe[Xn+l:T\Xl:n)Pe{yn+l:T\yi:n, Xi-.T) 



111=2 ^kixi-.k e Sfe|w»_J 



X 



.fe=2 



CX W^{xi.,n)pe{Xn+l:T\xi:n)Peiyn+l:T\yi:n,Xl:T) 

CX (a;i:„ |a;„+i:T) , (32) 



where for the third proportionaHty we have used (19l-(20l and an expansion of W^{xi;n) which is the direct 



analogue of (29) but for final time index n. 
To estabhsh the assertion of the proposition we use an inductive argument over the iterations of the back- 
ward samphng algorithm (indexed by n = T, T — The inductive hypothesis is that for some in- 
dex n satisfying l<m<n<Tof the backward sampling procedure, (XJ.-^, Si, S2, S„) obtained 
immediately after sampling from the backwards weights is distributed according to the marginal distri- 
bution TT^ {xi.T,Si,S2, ■■■,st). This implies (X^.^, Si, S2, s„_i) is distributed according to 
Exi.„„i Es„ ■■■Est '^^^(^i^T'Si'Sa, •■•,St)- Then at time step n - 1, due to Eq. ([32|), (X^^^,, Si, S2, s„_i) 



obtained after sampling from the backward weights is distributed according to J2s ■■■ X^sr ""e (^1:T, Si, S2, 
and thus X[.rp is distributed according to ... X^st '^e' i^i--Ti Si, S2, Sx) = Pe{xi:T\yi:T)- Next note that, 
due to Eq. (31 1 the first step of the backward sampling procedure draws from tt^ (a;i:T|si, S2, ...,st). The 



proof is then complete under the assumption of the proposition. ■ 

Proof of Theorem |2] For part 1, it is easy to check that steps 1 — 4 of the PG algorithm define a collapsed 
Gibbs sampler targeting Eq. ( 24 1 . This follows from Proposition [l] and the fact that the conditional DPF 



update, given a value of 6 and xi;t, is nothing but an algorithm sampling from 

1p^{si,S2, ...,St) 



, Tl=2 



Un=2 r^ixi:n G S„|w^_i) 



For part 2, we focus on establishing irreducibility and aperiodicity of the transition probability of this 
algorithm. We denote by Co the law of the Gibbs sampler to which assumption [2] applies and CpQ the law 
of the PG sampler using N particles. 

For any set U write 2^ for the power set of U and let B{Q) denote a cr-algebra on O. Let A x S x C G 
B{e) X 2-^^ X nLi2^^'^"^ be such that tt^ {d e A,Xi.,t G B, Si, St-i G C) > 0. It follows that 
tt{{9,Xi.t) (z a X B) > and then from irreducibility of the corresponding Gibbs sampler (Assumption [2]) 
there exists a finite j such that Cc{{d{j), Xi-tU)) & A x B) > 0. 

From the definition of the conditional DPF update, it is straightforward to check that, for any 6 ^ Q, 
N > 2, given any xi;t and for any time step, any particle which has positive weight immediately before 
resampling has a positive probability of surviving that resampling step. Thus, by an inductive argument 
in n, any point in the support of pg{xi-T\yi:T) has positive probability of being assigned a positive weight 



24 



at time T. It then follows from the above arguments that A x B is marginally an accessible set of the PG 
sampler for the same j: i.e. CpQ{{9{j), Xi-xiJ)) G Ax B) > 0. Furthermore, as the conditional DPF update 
corresponds to drawing from the conditional of tt^ given 9 and Xi-t, 

^PGiMj + 1)> XMj + 1), Si(j + 1), St{j + l))eAxBxC)>0 

and irreducibility follows. Furthermore, aperiodicity of the PG sampler holds by contradiction: if the PG 
sampler were periodic, then the Gibbs sampler would be too; this violates Assumption 
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